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Application/Control Number: 10/082,893 Page 2 

Art Unit: 21 12 

(1 ) Real Party in Interest 

A statement identifying by name the real party in interest is contained in the brief. 

(2) Related Appeals and Interferences 

The examiner is not aware of any related appeals, interferences, or judicial proceedings 
which will directly affect or be directly affected by or have a bearing on the Board's 
decision in the pending appeal. 

(3) Status of Claims 

The statement of the status of claims contained in the brief is correct. 

(4) Status of Amendments 

All amendments have been entered 

(5) Summary of Claimed Subject Matter 

The summary of claimed subject matter contained in the brief is correct. 

(6) Grounds of Rejection To Be Reviewed On Appeal. 

The grounds of Rejection to be reviewed on appeal contained in the brief is correct. 

(7) Claims Appendix 

The copy of the appealed claims contained in the Appendix to the brief is correct. 
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(8) Evidence Relied Upon 

US 6,324,609 Davis etal. 11-2001 

Article 'Gigabit Ethernet PCI Adapter Performance' by Mitchell L. Loeb, IBM Corporation 

(9) Grounds of Rejection 

The following ground(s) of rejection are applicable to the appealed claims: 

Claims 1-30 are rejected under 35 U.S.C. 102(e) as being anticipated by Davies et al. 

(US Patent 6,324,609) 

As for claims 1 , 1 1 , 21 and 26, Davis teaches A method and apparatus comprising: 
transferring data from a host memory to an Ethernet device (see figures 1 , 7 and 
column 1 lines 30-32 and column 14 line 27 to column 15 line 25, wherein the host 
processor 3 sends out a Type 1 configuration to bridge 29 to find all devices connecting 
to the secondary bus 15 to load software drivers to control these devices); and 
processing the data without sending the data from the host memory to an embedded 
memory associated with an adapter that includes the Ethernet device (see column 14 
line 27 to column 15 line 25, wherein bridge 29 includes the 1/0 processor 5 processes 
the Type 1 configuration by converting to the Type 0 configuration as discloses in figure 
7). 
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As for claims 2-4, 12-14, 22 and 27, Davis teaches including forming protocol headers 
in the embedded memory (see column 5 line 44 to column 6 line 5 and table 1). 

As for claims 5 and 15, Davis teaches computing checksums in firmware and in the 
Ethernet device (see column 13 lines 65-67). 

As for claims 6-9, 16-19, 23-25 and 28-30, Davis teaches determining whether data in 
said host memory is larger than an Ethernet maximum transmit unit (see column 15 line 
25 to column 16 line 17). 

As for claims 10 and 20, Davis teaches detecting the address of an access request from 
an Ethernet device and routing said request to the host memory or embedded memory 
based on the address (see column 16 lines 63-67). 

(10) Response to Argument 

Appellant's Brief filed on 4/1 1/06 have been fully considered but does not place 
the application in condition for allowance. 

1 . Appellant argues that there is no discussion of any transferring data from a 
host memory in the reference. Sending configuration commands to the I/O processor or 
anything else is come kind of providing of data is simply baseless. Examiner respectfully 
disagrees. As Davis notes at column 1 lines 27-38, Davis clearly teaches drivers are 
loaded into the memory of the host processor and address space must be allocated to 
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those devices connecting to the PCI bus. Further, an example provided in column 1 
lines 39-50, Davis teaches the host processor sending Type 1 command to the bridge 
to find out all devices are connecting to the PCI bus so it can load drivers to these 
devices. These facts anticipate what is claimed i.e. the host processor included a host 
memory for storing device drivers and the device driver is being loaded to those devices 
that are connecting to the PCI bus. Furthermore, column 14 lines 27 to column 15 line 
25 describe the configuration for loading device drivers to the devices connecting to the 
PCI bus. Devices are connecting to the PCI bus 15 from figure 1 of Davis are 
considered as Ethernet device as claimed which is equivalent to the appellant's 
specification as shown in figure 1; and the configuration data is a type of data unless 
specified. Thus, the prior art teaches the invention as claimed and the claims do not 
distinguish over the prior art as applied and Appellant's position is not seen to be 
persuasive towards patentability. 

2. Appellant argues that there is absolutely no mention in the reference of any 
Ethernet device. Examiner respectfully disagrees. As Davis notes at column 14 lines 27 
to column 15 line 25 describe the configuration for loading device drivers to the devices 
connecting to the PCI bus. The PCI devices are connecting to the PCI bus 15 from 
figure 1 of Davis are considered as Ethernet devices as claimed which is equivalent to 
the appellant's specification as shown in figure 1. Examiner further cited an article 
"Gigabit Ethernet PCI Adapter Performance" which can verified that PCI device is can 
be an Ethernet device. Thus, the prior art teaches the invention as claimed and the 
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claims do not distinguish over the prior art as applied and Appellants' position is not 
seen to be persuasive towards patentability. 

3. In response to appellant's argument that the references fail to show certain 
features of applicant's invention, it is noted that the features upon which applicant relies 
(i.e., how if data was transferred to the Ethernet device, that data could be transferred 
without transferring it to an internal memory of the Ethernet device) are not recited in the 
rejected claim(s). Although the claims are interpreted in light of the specification, 
limitations from the specification are not read into the claims. See In re Van Geuns, 988 
F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993). 

(11) Related proceeding(s) Appendix 

No decision rendered by a court or the Board is identified by the Examiner in the 
Related Appeals and Interferences section of this Examiner's answer. 

For the above reasons, it is believed that the rejections should be sustained. 
Respectfully Submitted, 
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Gigabit Ethernet PCI Adopter Performance 

Mitch II L Lo b, Andr w J. Rind s, William G. Holland, and St v n P. W let 

IBM Corporation 



Abstract 

In this article we analyze the performance of a commercially available Gigabit Ether- 
net adapter, using TCP/IP running on a Windows NT operating system. We show 
how performance varies with number of clients, sessions per client, number of system 
processors, and speed of system processors. Throughout, we discuss how the Ether- 
net protocol, adapter hardware, operating system, and device driver interact to pro- 
duce the throughput results measured, ana suggest ways to improve performance. 



tven before finalization of the Gigabit Ethernet stan- 
dards over fiber and twin-axial cable in June 1998 
(IEEE 802.3z, since incorporated into [1]), Gigabit 
Ethernet adapters were commercially available from 
several vendors. Improvements to these earliest adapters have 
been continuous over the intervening years, even as the latest 
gigabit Ethernet standard (IEEE 802.ab for category 5 copper 
cable [2]) was being finalized (June 1999). In the world of 
computer networking, Gigabit Ethernet therefore represents a 
relatively mature technology. It was developed to provide a 
high-capacity Ethernet-based network backbone that allowed 
for the aggregation of lower-speed (10 and 100 Mb/s) Ether- 
net LAN traffic. It was also developed to provide significantly 
greater throughput capacity to large network servers, increas- 
ing their ability to service ever larger numbers of clients in a 
timely manner. While the standards support Gigabit Ethernet 
in both full-duplex (switched) and half-duplex (shared) mode, 
the performance of half-duplex Gigabit Ethernet is severely 
limited. We therefore focus only on full-duplex performance 
in this article. 

Because Gigabit Ethernet adapters have been available for 
so long, there have been a number of papers that have 
explored their performance (e.g., [3-6]). Early adapters 
examined in [3] showed difficulty in achieving more than 
200-300 Mb/s for single-session (single client/single server) 
data transfers, while multiple client-to-server aggregate 
throughputs rarely exceeded 500 Mb/s. Other testers have 
reported similar throughputs. These early performance num- 
bers are far below the promise of gigabit-per-second trans- 
fers. In this article we therefore first revisit these throughput 
tests, to determine if and how performance has improved 
with recent advances in hardware and device driver software. 
We specifically examine the current performance of the 
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IBM™ Netfinity™ Gigabit Ethernet SX adapter in high-per- 
formance IBM Netfinity 6000 and 7000 servers. Because of 
their predominance within today's networks and workstations, 
the protocol and operating systems tested were, respectively, 
Transmission Control Protocol/Internet Protocol (TCP/IP) 
and Microsoft™ Windows NT™. 

As our data will show, the Gigabit Ethernet adapter can 
now achieve approximately three-fourths of its stated media 
speed. 1 Therefore, in the remainder of the article we examine 
those factors that account for the observed difference 
between the Gigabit Ethernet media speed and the measured 
adapter throughput. We first discuss the various overheads as 
defined by the Ethernet physical and higher protocol layer 
standards (which include certain required headers, an inter- 
frame gap, etc.), all of which combine to reduce the maxi- 
mum achievable throughput. We then carefully analyze 
various system bottlenecks. We will show that the limitations 
are due to the server software and hardware limitations, and 
not the network adapter. 

Measurements 

Figure 1 shows the configuration used to measure Gigabit 
Ethernet adapter performance. An IBM Netfinity Gigabit 
Ethernet SX adapter (based on Intel™ 82542 technology) was 
installed in one of the 64-bit, 33 MHz peripheral component 
interconnect (PCI) [7] slots in an IBM Netfinity 7000 M10 or 
6000R server. The 7000 M10 server was equipped with four 
550 MHz Intel Pentium™ II Xeon™ processors, and the 
6000R server had four 700 MHz Intel Pentium III Xeon pro- 
cessors. The 6000R was used to make all of the measurements 
for the 700 MHz processor data points. By changing the clock 
speed jumpers on the 7000 M10, that server was made to 
operate as if it had either 400, 450, 500, or 550 MHz proces- 
sors. The device driver was IBMGENT4.SYS (size 35 kbytes, 
dated 3/10/99). 

The Gigabit Ethernet adapter was connected by optical 
fiber to a gigabit port on an Intel 5101™ Ethernet switch. The 
switch also had 10/100 Mb/s Ethernet ports. Eleven of these 



1 These measurements were made between January and March 2000. 
Faster system processors should show improved throughputs for the 
adapters. 
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I Figure 1 . Configuration of the measurement setup. 
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I Figure 2. Throughput in the 700 MHz server as a function of 
the number of clients and sessions per client. 



ports were used to connect 100 Mb/s Ethernet clients to the 
server, using twisted pair copper cabling. All adapters and 
switch ports were configured for full-duplex operation, mean- 
ing that frames could be simultaneously transmitted and 
received on any connection. The client systems were IBM 
PC300GL desktop computers, each with one 300 MHz Pen- 
tium II processor. An IBM 10/100 Etherjet 1 PCI adapter was 
installed in a 32-bit, 33 MHz PCI slot in each client. The 
servers used Microsoft Windows NT 4.0 with Service Pack 5 
as the operating system and Microsoft TCP/IP as the trans- 
mission protocol. 

Throughput was measured using Ganymede™ Corp.'s 
Chariot™ software tool (v. 3.1). One of the 11 clients was 
used as the controller, and thus did not transfer any file data 
to the server. The Chariot script called filercvl (file receive 
long) was specifically used to simulate the server requesting, 
and then receiving, a series of 100,000-byte files from each of 
the clients participating in the test. This is a memory-to-mem- 
ory transfer, and the data is not actually written to disk. The 
length of each test was variable, with most tests transferring 
data for at least one minute. Clients were added one at a time 
in order to increase the load on the gigabit adapter in the 
server. In addition, the server could request more than one 
file at a time from each client through the use of multiple 
TCP/IP sessions. 

Figure 2 shows the throughput of the gigabit adapter in the 
6000R server as a function of the number of clients, while 
receiving a series of 100,000-byte files from each client. The 
bottom line shows the single-session throughput of the gigabit 
adapter as the number of clients is increased from 1 to 10. 
The middle line shows the throughput when each client is 
running two TCP/IP sessions with the server. The top line is 
the maximum throughput when the number of sessions per 
client was increased until the performance peaked. In this 
particular case, it never required more than six simultaneous 
sessions on each client to saturate the server. The best 
throughput achieved was 754 Mb/s (using eight clients with six 
sessions each). Notice that the throughput rises linearly from 
1 to 8 clients, with each client approximately realizing the full 
theoretical limit of 100 Mb/s Ethernet, 94.9 Mb/s (discussed 
later). After the eighth client, the server becomes the bottle- 
neck, so adding additional clients yields no increase in 
throughput. 

Figure 3 shows the effect of multiple processors on the 
throughput as measured in the 7000 M10 running at the 550 
MHz processor clock speed. A single processor is unable to 
produce the best throughput for this server. Once a second 
processor is added, however, adding additional processors 
does not improve the throughput. This can also be shown by 
observing processor utilizations by using the Windows NT 
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■ Figure 4. Throughput as a function of processor speed and 
number of clients (2 sessions/client). 

Performance Monitor. When using 7 clients with 7 sessions 
each, one of the system processors was running at almost 100 
percent utilization, while the other processors were running at 
approximately 10 percent each. 

Figure 4 shows the effect of increasing the number of 
clients and the processor speed while using two sessions per 
client. 

Figure 5 shows the peak throughputs achievable with any 
number of clients or number of sessions per client at each of 
five different processor speeds: 400, 450, 500, 550 and 700 
MHz. The figure shows an almost linear relationship between 
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processor speed and maximum gigabit adapter throughput. 
The formula for the straight line is shown, as well as the cor- 
relation coefficient. This linear relationship shows that the 
bottleneck to achieving the theoretical maximum Gigabit Eth- 
ernet performance is due to the system (the processor speed, 
and operating system and device driver software), not to the 
adapter or Ethernet switch. The rest of this article examines 
the theoretical throughput potential of the gigabit adapter, 
and then analyzes where the bottlenecks occur that prevent 
achieving this potential. 

Effects of Ethernet Overhead on Throughput 

In the tests described above, the 100,000-byte files trans- 
ferred by Chariot's filercvl script represented only a portion 
of the total bits that had to be transmitted over the Ethernet 
physical media. Some of the additional transmitted bits rep- 
resent the minimum overhead required by the Ethernet and 
TCP/IP standards to transmit the files. The Ethernet stan- 
dard defines a minimum and maximum frame size of 64 bytes 
and 1518 bytes, respectively. A large file must therefore be 
broken up into pieces that will fit inside such frames. Each 
frame must include 14 bytes for an Ethernet header and 4 
bytes for a cyclic redundancy check (CRC) for data integrity. 
Within the remaining data portion of the frame, an addition- 
al 40 bytes are required for the IP header and TCP header 
(20 bytes each). This leaves anywhere from 6 to 1460 bytes 
for the actual file data (within a 64- to 1518-byte frame). In 
addition, Ethernet requires 8 bytes for a preamble before 
each frame, as well as 12 bytes for an interframe gap between 
frames. The adapter sees the preamble and interframe gap, 
but they are not transferred across the system bus into the 
server. Figure 6 shows the data (dark areas) and 
overhead as percentages of the total available band- 
width. The bottom picture shows the overhead for a 
64-byte frame, the top one the same overhead for a 
1518-byte frame. 

The percentage of bandwidth available for data 
therefore varies with Ethernet frame size (Fig. 7). 
Ethernet with TCP/IP contributes 78 bytes of 
overhead transmitted with every frame. For the 
maximum frame size of 1518 bytes, the real data 
represents 1460 out of every 1538 bytes physically 
transferred (1518 plus preamble and interframe 
gap), or 94.9 percent of the available bandwidth. 
This means that for 100 Mb/s Ethernet, the maxi- 
mum data transfer rate is 94.9 Mb/s, while for 
Gigabit Ethernet, the maximum data rate is 949 
Mb/s. Because of the Ethernet maximum frame 
size of 1518 bytes, the 100,000-byte file is broken 



into 68 packets of 1460 bytes each, and one packet con- 
taining the final 720 bytes. Therefore, the file data trans- 
ferred by Chariot's "filercvl" script would have been 
expected to come very close to the theoretical throughput 
limit of 949 Mb/s. 

Analysis of the Data Receive Process 

In the previous section we examined the maximum theoretical 
efficiency of useful data transfers, given the overhead require- 
ments defined by the physical standards and higher-layer pro- 
tocols. In this section we examine how data is actually 
transferred from the physical Ethernet media, through the 
adapter and into the server, up to the application layer. The 
details of this transfer process therefore define the ability of 
the actual hardware, specifically the PCI bus and server pro- 
cessor, to deliver data at the stated media speed. Only the 
receive process was examined, since the throughputs and anal- 
ysis for the transmit process are similar. This analysis was 
accomplished, in part, using a logic analyzer to monitor the 
control and data transfers on the PCI bus in the server. Based 
on the analysis of the PCI bus transactions using the logic 
analyzer, together with knowledge of similar Ethernet 
adapters, we were able to determine the operational behavior 
of the Gigabit Ethernet adapter. 

As will be shown, the server handles the processing differ- 
ently for Pentium processor speeds at or below 500 MHz than 
for processor speeds at or above 550 MHz. 

When the System Processor Speed Is 
500 MHz or Below 

This analysis was based on the 7000 M10 server using 400 
MHz Pentium processors. There is some variation from frame 
to frame, depending on what else is going on in the operating 
system. The numbers shown are averages that will give a 
throughput of approximately 500 Mb/s. 

When a frame initially arrives at the adapter, the adapter 
stores the data plus CRC and headers (e.g., 1518 bytes for a 
maximum-size Ethernet frame) in a receive first-in first-out 
(FIFO) queue. Only after enough bytes have accumulated to 
exceed the FIFO's preset threshold does the adapter request 
the server PCI bus to transfer the data into server memory. 
The server memory has a number of receive buffers already 
allocated, each of which can hold an entire 1518-byte maxi- 
mum-size Ethernet frame. There is a circular queue of 
receive descriptors, each of which points to both a buffer 
and the next descriptor in the queue, and also indicates 
whether or not the buffer is free. When the data starts com- 
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■ Figure 7. Throughput percentage improvement as frame size 
increases. 



ing into the adapter FIFO, the adapter therefore already 
knows the location of the next descriptor to use, and can 
burst the data into a system memory buffer using PCI mem- 
ory write and invalidate (MWI) commands. PCI MWI trans- 
actions are used to transfer all bytes of the frame except for 
the last 14 bytes, since MWI requires that data transfers 
start and stop on cache line boundaries. The type of system 
processor determines the cache line size. For a Pentium, the 
cache line size is 32 bytes. The device driver (DD) ensures 
that all of the data buffers start aligned to the cache line 
boundary so that cache aligned memory accesses such as 
MWI can be used effectively. Dividing a 1518-byte frame 
into cache lines results in 47 cache lines to be transferred 
using MWI, with a remainder of 14 bytes which must be 
transferred as a separate transaction using a standard memo- 
ry write. 

The data portion of the PCI bus in the server used for the 
tests in an earlier section was 64 bits wide and operated at 33 
MHz. During burst transfers, 64 bits (8 bytes) could therefore 
be transferred on each clock cycle (i.e., 30 ns), yielding a peak 
data rate of 64 x 33 Mb/s, or 2.112 Gb/s. Therefore, 1504 
bytes (representing a 1518-byte frame minus the last 14-byte 
fragment) could potentially be transferred in 188 (1504 divid- 
ed by 8) PCI clock cycles. However, the adapter FIFO will 
typically run out of data before the entire frame is transferred. 
This occurs because the PCI bus transfer rate is much faster 
than the gigabit channel transfer rate, and because the FIFO 
starts to transfer data to server memory using a threshold that 
is much smaller than a maximum Ethernet frame. When this 
happens, the adapter will add wait states to the 
transfer, effectively slowing the PCI bus down 
to a 1 Gb/s transfer rate. An average of 100 
PCI clock wait states (3 u.s) is added to each 
transfer. The adapter then has to write the 
final 14 bytes using a memory write transfer, 
which takes an average of 8 PCI clock cycles 
(240 ns). 

Having completed the transfer of an Ether- 
net frame, the adapter then writes the status of 
the transfer to memory, using an average of 17 
PCI clock cycles (approximately 0.5 |is). The 
adapter must wait until its FIFO has received 
enough of the next frame to reach its FIFO 
threshold (we include all of this time as FIFO 
refill because there is no other activity on the 
PCI bus at that time, although there could be 
other system activity). This requires an average 
of 330 PCI clock cycles (10 |is), but can be 



overlapped with other operating system (OS) and DD opera- 
tions (to be discussed shortly). Each frame thus requires 
about 19.4 |is. The process described above is then repeated 
in transferring the next frame. 

Figure 8 summarizes the above description, showing how the 
frames look as they are received on the PCI bus (as observed 
using the logic analyzer). PCI clock cycles have been converted 
into time in milliseconds, microseconds, and nanoseconds, 
using the conversion factor of 1 PCI clock cycle = 30 ns. On 
average, it was observed that 210 frames (varied from 200 to 
230) were sent from the adapter to memory in a continuous 
stream from the clients. 

After a frame has been transferred to the server's system 
memory, its corresponding status write is accompanied by an 
interrupt notifying the processor that the frame has been 
received. The interrupt handler's processing of each frame 
takes slightly longer than the time required to receive the next 
frame. Therefore, given a continuous stream of frames (as was 
the case with our testing) from the first frame until the last, 
the processor is continuously executing the interrupt handler, 
looping through each received frame. The interrupt handler 
portion of the frame processing merely copies the received 
data into a separate data buffer (inside the receiving server 
memory), returns the original buffer to the adapter, and 
queues a deferred task that will eventually process the 
received frame. When the last of the burst of frames has been 
received, the interrupt handler must finish catching up with all 
of the received frames. The OS can then resume normal sys- 
tem operation and begin to dispatch the deferred tasks associ- 
ated with each of the received frames. The protocol stack can 
construct any necessary TCP/IP acknowledgments, and the 
application itself can be dispatched by the OS, at which time 
the received data can be recognized and subsequent 100,000- 
byte file requests issued. This takes, on average, 0.8 ms. This 
rate of, on average, 210 frames x 1460 bytes/frame in 4.9 ms, 
matches the throughput of 500 Mb/s. 

When the System Processor Speed Is 
550 MHz or Higher 

When the speed of the system processors increases to 550 
MHz, the receive process as seen on the PCI bus changes dra- 
matically. Instead of bursts of 210 frames, followed by a long 
pause while the OS and DD catch up on previously postponed 
tasks, the bursts are reduced to an average of 6 frames, fol- 
lowed by a small amount of OS and DD activity. This bursti- 
ness totally disappears when the processor speed increases to 
700 MHz. The system is now apparently fast enough to do all 
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the system processor speed is 400 MHz (OS: operating system, DD: device driver). 
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I Figure 9. The data receive process with system processor speed of 700 MHz. 



of the interrupt handling, buffer replacement, and other OS 
and DD tasks between each frame, while the FIFO is refilling. 
Figure 9 shows the receive process that gives a throughput of 
approximately 754 Mb/s. 

There is presently not enough information to predict what 
will happen as system processor speeds increase beyond 700 
MHz. Obviously, there is a maximum theoretical throughput 
of 949 Mb/s. The straight line relationship between processor 
speed and throughput (as shown in Fig. 4) means that the 
adapter has not been the limiting factor so far. This could 
change if, as suspected, the FIFO requires some minimum 
amount of time to refill. 

Possibilities for Increasing Throughput When 
Processors Are Below 550 MHz 

One obvious question to ask is what impact the use of two or 
four processors had on improving system throughput. During 
the interrupt handling period, only the processor that accept- 
ed the interrupt was busy. As outlined above, it was fully con- 
sumed processing the incoming frames and is not fast enough 
to keep up. For our four-processor server, the other proces- 
sors were therefore of no benefit during this initial interrupt 
handling period. It appears that all four processors could 
simultaneously process the deferred tasks (following the com- 
pletion of the interrupt handling period), but the transmission 
of the subsequent requests required serialization through the 
protocol stack, which significantly reduced the benefit of the 
other processors. Between completion of the deferred tasks 
and receiving the next group of 6 frames, all processors were 
relatively idle. The task serialization and lack of parallelism 
among the observed processing steps, therefore, significantly 
limited the potential throughput of the four-processor system. 

There appears to be some batch-mode optimization within 
the DD. At a minimum, the requeuing of buffers back to the 
adapter is bunched into groups of approximately 18 at a time. 
It is assumed that other operations are grouped on a similar 
18-frame basis. The batch mode optimizations apparent in the 
DD probably provide a minimal increase in software efficien- 
cy. A similar batch mode process in the adapter would be a 
more significant improvement. The PCI bus efficiency would 
be greatly increased if the FIFO threshold that initiated the 
write operations was set much closer to the maximum frame 
size. This would allow the entire frame to be written across 
the PCI bus without any inserted wait states. While this would 
not improve the overall measured throughput for a single 
Gigabit Ethernet adapter, it would increase the PCI availabili- 
ty time between frames, which might then be used by other 
devices attached to the bus. Alternatively, the adapter could 



disconnect off the PCI bus, rather than insert- 
ing wait states, but this still incurs the over- 
head of starting and stopping the PCI bus 
transactions, and is limited to stopping the 
MWI operations only on cache line aligned 
boundaries. 

The requeuing of the receive buffers 
requires a write to the adapter for each buffer. 
The optimization we observed (batch mode 
grouping of these writes) could be taken one 
step further by modifying the adapter architec- 
ture so that writing a single address could 
requeue an entire collection of receive buffers. 

The interrupt processing was able to use only 
one of the system processors, since only a single 
interrupt was presented. All subsequent inter- 
rupt handler processing was done by this single 
processor, based on rechecking the receive queue as it finished 
each packet. It is not known at what point the second processor 
is able to take some of the workload. It is assumed that the pro- 
tocol processing is likewise limited to a single processor, but the 
verification of this is beyond the scope of this article. 

Observing that the interrupt handler portion of the receive 
process merely frees the receive buffers reaffirms prior knowl- 
edge that Windows NT Network Driver Interface Specifica- 
tion (NDIS) Miniport drivers used with network adapters 
require most of the packet processing work to be queued as a 
deferred task. After the interrupt processing is complete, the 
OS has, on average, 210 new tasks to launch based on the 
entries created in the deferred task queue. The overhead of 
launching these tasks, combined with the launch granularity 
(how many start at one time) and the limits on how many 
tasks can be running simultaneously, all contribute to addi- 
tional dead time during which the server is not yet processing 
the received data. If, instead, each packet were fully processed 
within the interrupt handler, it is possible that the application 
could be running in parallel on another processor, with a con- 
tinuous stream of fully processed incoming packets from the 
first processor. The trade-off would be that the processing 
would fall behind the incoming packet rate much more severe- 
ly than it did in the experiment. This would require additional 
free buffers so that the adapter would not run out before all 
incoming packets had been received. 

A significant portion of the interrupt handler is the memo- 
ry-to-memory copy performed so that the receive buffer can be 
requeued to the adapter. The prior discussion of completely 
handling each packet within the interrupt handler could elimi- 
nate the need for the data copy in cases where the data is not 
to be saved on the server, or where only a subset of the data 
needs to be copied. A further improvement that eliminates the 
necessity to copy each packet would save significant time in 
packet processing. Novell™ servers are architected such that 
no copy is performed. It is beyond the scope of this article to 
recommend how to implement such a function in this OS. 

As observed in this experiment, there is a serialization of 
functions such that there is no (or minimal) overlap between 
the three stages of processing: packet handling, protocol han- 
dling, and application processing. What is not evident is the 
split between OS/DD/application processing and true idle time 
waiting for responses from the clients. If there is no idle time, 
the system will appear to be providing maximum throughput. 
If there is idle time, the experiment is flawed in not having suf- 
ficiently low latency in the client responses to fully utilize the 
server. It is believed that the CPU utilization measured during 
this experiment demonstrates that there is no significant idle 
time at the server. Note that the OS will incorrectly count all 
time during which the processor is idle due to executing read 
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instructions directly to the adapter. This is because the proces- 
sor must wait for completion of the read operation (many 
microseconds). A similar effect will occur when write opera- 
tions are executed in rapid succession, forcing the processor's 
write posting buffers to fill, and likewise stalling the processor. 
The adapter in the experiment correctly uses memory reads 
and writes to avoid stalling the processor on every write. The 
batch mode processing of the receive buffer requeuing might 
be efficient enough to overrun the processor's write posting 
buffers. It is not possible from the PCI trace to determine if 
this is happening in the experiment. But as processors get 
faster, and the bus and adapter stay the same speed, this will 
eventually become a real issue. Performance improvements 
that involve deserializing the three processing steps should 
produce dramatic results. If the three time portions were 
equal, there is reason to believe that making them fully paral- 
lel operations, with the tasks executing on multiple processors, 
would triple the theoretical throughput limit of the system. 

Conclusions 

This article shows that, using the systems available today, we 
can achieve data rates approximately three-fourths the media 
speed for Gigabit Ethernet. We have shown that for typical 
commercially available multiprocessor servers, there is a linear 
relationship between server processor speed and maximum 
Ethernet throughput. We have also shown where the differ- 
ences arise between theoretical and actual throughput, and 
have suggested possible ways to decrease these differences. 
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